2026-01-07
Source: Sora
Source: Sora
Palaeo data come in many forms
Probably not
Becomes a problem when we start to collate data into a large database
Even within a single proxy we have inconsistent data
Where we’ve sampled almost surely is not random — convenience sample
Almost surely irregularly spaced in time
No one in this room needs to be told any of this
Source: Sora
Themes from a brief search of literature (AKA things I’ve been involved with)
Long term changes in spatial organisation and distribution of species and ecosystems
Traditional methods used in palaeo are unlikely to help with analyses that compare across more than two taxonomic groups or data of different types
What do we do if we have different resolution data within a proxy?
Or different data representations?
Or different proxies at different sites / samples
Can we use all the data?
Source: Sora
Source: Jeff Steinke
Lots of developments in the statistical ecology and omics worlds we can take advantage of
integrated SDMs
joint species distribution models
model-based ordination
copula models (marginal models for multivariate responses)
graphical models & networks
…
Not just for the sake of being novel
Newer methods enable estimation of new quantities — new/better answers to questions
Integrated species distribution models
General way to combine — integrate — disparate data
species’ distributions are aggregated spatial locations of all individuals of the same species across a geographical domain
the distribution can be described by a spatial point process, where local intensity (density) of individuals varies
SDMs are a direct or indirect model of this underlying point process
data integration requires linking each data source to the common underlying point process while accounting for differences among data types
A spatial point process describes the distribution of event locations across some spatial domain
Random process generating points, described by the local intensity \(\lambda_{s}\)
\(\lambda_{s}\) — expected density of points at spatial location \(s\)
If points are random, independent and follow a Poisson distribution with mean \(\lambda_{s}\), homogeneous Poisson process (\(\lambda_{s} = \lambda \; \forall \; s\))
If \(\lambda_{s}\) varies across \(s\), we have an inhomogeneous Poisson process
Other distributions are available
These work in time as well
Miller et al (2019). Methods Ecol. Evol. 10.1111/2041-210X.13110
The different data sets have their own “model” and the likelihoods are combined during fitting
Allows mixing of different types of data
Similar idea to combine likelihoods from different types of data
gfam() family in Simon Wood’s mgcv 📦Miller et al (2019). Methods Ecol. Evol. 10.1111/2041-210X.13110
Instead of modelling one species at a time and stacking the models, Joint Species Distribution Models estimate all species at once
Ideally we’d combine integrated SDMs with JSDMs but as yet, I’m not aware of much work yet (but see Gelfand & Schliep, 2025)
JSDMs can be used to fit model-based ordinations — might have to move away from traditional ordination methods to handle features of our data properly
We don’t have to repeat everything that our non-palaeo ecologist colleagues have worked through already — jump to the head of the line
What if we don’t have the same proxies measured at the same set of sites? — spatial misalignment
What if proxies represent different amounts of space (time)?
This is covered under the problem of change of support and the concept of data fusion
For larger data sets, computation using these newer methods becomes difficult
more parameters vs. simplifications & approximations
newer methods & algorithms, GPUs, etc are helping with this
Most methods demonstrated with 10s of taxa — Galore (Pound & O’Keefe; 2025 Palynology) pollen & spore data set has >1000 genera
Even describing larger data sets presents challenges
Want lower dimensional view of the data — topic models
Summarise each sample as being made up of proportions of \(A \ll m\) “associations” (don’t hate on me!)
“associations” are learned from the data — proportions of each taxon in each “association”
each individual in a sample is modelled as a draw from the distribution of “associations” and then a draw of a taxon from that “association”
display (model?) data using these \(A \ll m\) “associations”
Source: Giphy
Working with large, disparate, heterogeneous data sets is hard
Using newer statistical approaches is essential to handle this heterogeneity
Some progress has been made — more happening all the time
Diversity metrics are very non-Gaussian
Any modelling of “diversity” needs to handle the sediment accumulation problem
Time averaging different amounts of time per sample leads to
Same problem affects any modelling of any palaeo data, save for annually laminated records…
Effort problems plague “microbiome”-type data
Rare or data-deficient species?
Large training sets — throw out rare species, singletons etc
eDNA — “filtering” throws away a lot of data (& please don’t rarefy to counts)
Hierarchical models involving random “effects” allow us to borrow strength from more data-rich taxa
Sharma et al (in press). No species left behind: borrowing strength to map data-deficient species. Trends Ecol. Evol. 10.1016/j.tree.2025.04.010
Over in the Omics cinematic universe, those folks are doing their own thing integrating disparate kinds of data
Popular techniques are focused around extensions to PLS
Multiple different types of omics analysis on the same samples
If we can’t / don’t want to use these newer methods, what can we do with dissimilarities?
Fused dissimilarities
Then analyse using NMDS or db-RDA, etc.
Very hard to say taxon x extirpated from this lake at this time
Most palaeo data is presence only
Possible with associated marks — abundance or biomass conditional upon the taxon being found
We don’t know (statistical) things about the taxa we don’t observe
Hard to put a probability on (e.g.) extirpation with this data
But ecologists have been doing this kind of work for decades — occupancy modelling
Most methods require repeated sampling
What would that look like for palaeo?
Could we count same number of things but over \(n \geq 2\) different “samples”?